run to run scan warpspeed impl sm100+#9263
Conversation
|
Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
OverviewThis PR implements run-to-run support for the warpspeed scan optimization on SM100+ targets, enabling deterministic DeviceScan execution. The changes introduce a stable reduction order variant of the warpspeed lookahead logic and thread this stability setting through the scan dispatch pipeline. ChangesWarpspeed Lookahead Stable VariantAdded
Warpspeed Scan Pipeline IntegrationExtended the warpspeed scan implementation in
Kernel Dispatch UpdateUpdated Policy Selection for Stable ReductionModified Related IssueCloses WalkthroughAdds a deterministic ChangesStable Warpspeed Scan Implementation
Assessment against linked issues
Possibly related PRs
Suggested reviewers
important: Confirm that warpspeed stable lookahead updates important: Verify the suggestion: Consider adding a static_assert or comment that documents Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)
1038-1047:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winsuggestion: Update the inline rationale for the
require_stable_reduction_order→cc >= {10, 0}gate:warpIncrementalLookaheadStableis available for__cccl_ptx_isa >= 860(sm_90+), but the scan policy selector only produces ascan_warpspeed_policywhencc >= {10, 0}(otherwiseget_warpspeed_policyreturns{}), so stable warpspeed on sm_90+ is blocked by warpspeed policy/tuning availability—not by stable lookahead codegen availability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dfcdb20c-106f-4ae5-a688-9e19e5475411
📒 Files selected for processing (4)
cub/cub/detail/warpspeed/look_ahead.cuhcub/cub/device/dispatch/kernels/kernel_scan.cuhcub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuhcub/cub/device/dispatch/tuning/tuning_scan.cuh
|
/ok to test cbd13bb |
🥳 CI Workflow Results🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913See results here. |
|
pre-commit.ci autofix |
| const ::cuda::std::uint32_t lanemaskEq = ::cuda::ptx::get_sreg_lanemask_eq(); | ||
|
|
||
| // Adjust the left pointer down to the nearest 32-multiple so we do batched sums | ||
| int idxTileCur = (idxTilePrev / 32) * 32; |
There was a problem hiding this comment.
Suggestion: Use cuda::round_down.
| AccumT aggrExclusiveCtaCur = aggrExclusiveCtaPrev; | ||
|
|
||
| using warp_reduce_t = WarpReduce<AccumT>; | ||
| static_assert(sizeof(typename warp_reduce_t::TempStorage) <= 4, |
There was a problem hiding this comment.
Why 4? I assume this is sizeof(uint32_t)? If so, best to say sizeof(uint32_t) instead (or better yet, refer to an actual type/value so that when that size is changed, the check automatically is as well).
There was a problem hiding this comment.
Because the TempStorage is a struct with further nested types that have no value, but because there are data members it has a size of 1. For some reason @elstehle chose 4 here, but the check is basically that no temporary storage is required. Btw, is_empty also does not work here.
There was a problem hiding this comment.
Could you please put this as a comment then in the src? 4 is quite a magic value to capture this, I would have expected 1 or something like that then
| [[maybe_unused]] typename warp_reduce_t::TempStorage temp_storage; | ||
|
|
||
| using warp_reduce_or_t = WarpReduce<::cuda::std::uint32_t>; | ||
| typename warp_reduce_or_t::TempStorage temp_storage_or; |
There was a problem hiding this comment.
Nit: typename is not needed here I think. WarpReduce<uint32_t> is not dependent on any of your template params.
Description
closes #7556
Checklist